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Abstract 

We describe a parallel algorithm for solving the time-independent 3d Schrodinger equation us- 
ing the finite difference time domain (FDTD) method. We introduce an optimized paralleliza- 
tion scheme that reduces communication overhead between computational nodes. We demon- 
strate that the compute time, t, scales inversely with the number of computational nodes as 
t oc (A'^nodes)"'^'^^^'^''^^- This makes it possible to solve the 3d Schrodinger equation on extremely 
large spatial lattices using a small computing cluster. In addition, we present a new method for 
precisely determining the energy eigenvalues and wavefunctions of quantum states based on a sym- 
metry constraint on the FDTD initial condition. Finally, we discuss the usage of multi-resolution 
techniques in order to speed up convergence on extremely large lattices. 

PACS numbers: 03.65. Ge, 02.70.-c, 02.30.Jr, 02.70.Bf 
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I. INTRODUCTION 



Solving the 3d Schrodinger equation given an arbitrary potential V{r) is of great practical 
use in modern quantum physics; however, there are only a handful of potentials for which 
analytic solution is possible. In addition, any potential that does not have a high degree of 
symmetry, e.g. radial symmetry, requires solution in full 3d, making standard "point-and- 
shoot" methods [l[ for solving one-dimensional partial differential equations of little use. In 
this paper we discuss a parallel algorithm for solving the 3d Schrodinger equation given an 
arbitrary potential V{f) using the finite difference time domain (FDTD) method. 

The FDTD method has a long history of application to computational electromagnet- 
ics In the area of computational electromagnetics parallel versions of the algorithms 
have been developed and tested [6- 13]. In this paper, we discuss the application of par- 



allelized FDTD to the 3d Schrodinger equation. The standard FDTD method has been 
applied to the 3d Schrodinger equation by several authors in the past j23-2^|. Here we 
show how to efficiently parallelize the algorithm. We describe our parallel algorithm for 
finding ground and excited state wavefunctions and observables such as energy eigenvalues, 
and root-mean-squared radii. Additionally, we introduce a way to use symmetry constraints 
for determining excited state wavefunctions/energies and introduce a multi- resolution tech- 
nique that dramatically decreases compute time on large lattices. This paper is accompanied 
by an open-source release of a code that implements the algorithm detailed in this paper. 
The code uses the Message Passing Interface (MPI) protocol for message passing between 
computational nodes. 

We note that another popular method for numerical solution of the 3d Schrodinger equa- 
tion is the Diffusion Monte Carlo (DMC) technique, see 13l-ll7l| and references therein. The 
starting point for this method is the same as the FDTD method applied here, namely trans- 
formation of the Schrodinger equation to imaginary time. However, in the DMC algorithm 
the resulting "dynamical" equations are transformed into an integral Green's function form 
and then the resulting integral equation is computed using stochastic sampling. The method 
is highly inefficient unless importance sampling is used. DMC is efficiently paral- 

lelized and there are several codes which implement parallelized DMC [20-23]. The method 



is similar in many ways to the one presented herein; however, the method we use does not 
suffer from the fermion sign problem which forces DMC to use the so-called "fixed-node 
approximation" In addition, although the DMC algorithm can, in principle, be ap- 

plied to extract properties of the excited states of the system most applications to date only 
calculate the ground state wavefunction and its associated expectation values. The FDTD 
method described herein can extract both ground and excited state wavefunctions. 

The organization of the paper is as follows. In Sees. [Ill and IIIII we briefly review the ba- 
sics of the FDTD method applied to the 3d Schrodinger equation and derive the equations 
necessary to evolve the quantum-mechanical wavefunction. In Sec. II VI we discuss the possi- 
bility of imposing a symmetry constraint on the FDTD initial condition in order to pick out 
different quantum-mechanical states. In Sec. |V]we describe our strategy for parallelizing the 
FDTD evolution equations and the measurement of observables. In Sec. I VI I we introduce 
an efficient method of using lower-resolution FDTD wavefunctions as initial conditions for 
higher-resolution FDTD runs that greatly speeds up determination of high-accuracy wave- 
functions and their associated observables. In Sec. IVIII we give results for a few potentials 
including benchmarks showing how the code scales as the number of computational nodes 
is increased. Finally, in Sec. IVIIII we conclude and give an outlook for future work. 
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II. SETUP AND THEORY 



In this section we introduce the theory necessary to understand the FDTD approach for 
solving the time-independent Schrodinger equation. Here we will briefly review the basic idea 
of the FDTD method and in the next section we will describe how to obtain the discretized 
"equations of motion" . 

We are interested in solving the time-independent Schrodinger equation with a static 
potential V{f,t) = y(r) and a particle of mass m 

E„V^„(f) = ^^„(r) , (2.1) 

where ipn is a quantum-mechanical wavefunction that solves this equation, £"„ is the energy 
eigenvalue corresponding to ipn, and H = — ^^V^/2m -|- V{r) is the Hamiltonian operator. 
In order to solve this time-independent (static) problem it is efficacious to consider the 
time-dependent Schrodinger equation 

v[/(f,t). (2.2) 

A solution to (12. 2p can be expanded in terms of the basis functions of the time-independent 
problem, i.e. 

oo 

^(f,t) = 5^a„V^„(f)e-^^"*, (2.3) 

n=0 

where {a„} are expansion coefficients which are fixed by initial conditions (n = represents 
the ground state, n = 1 the first excited state, etc.) and En is the energy associated with 
each state. -"^ 

By performing a Wick rotation to imaginary time, r = it, and setting h = 1 and m = 1 
in order to simplify the notation, we can rewrite Eq. (12. 2p as 

^^(f,r) = ^V^^{f,T)-V{r}^{f,T), (2.4) 
which has a general solution of the form 

oo 

^(r,r) = J]a„^„(f)e-^"^ (2.5) 

n=0 

Since Eq < Ei < E2 < for large imaginary time r the wavefunction ^['(f^, r) will be 
dominated by the ground state wavefunction aoV'o(r)^~^°^- limit r goes to infinity 

we have 

lim *(f, r) ^ ao^o(r)e~^«^ , (2.6) 

r— >oo 

Therefore, if one evolves Eq. (12. 4p to large imaginary times one will obtain a good approxi- 
mation to the ground state wavefunction.^ 



^ The index n is understood to represent the fuU set of quantum numbers of a given state of energy En ■ In 

the degenerate case ipn is an admixture of the different degenerate states. 
^ In this context a large imaginary time is defined relative to the energy splitting between the ground state 

and the first excited state, e.g. e^^""^'^^'^ <^ 1; therefore, one must evolve to imaginary times much larger 

than l/{Ei-Eo). 
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This allows one to determine the ground state energy by numerically solving equation 
fl2.4p for large imaginary time, and then use this wavefunction to find the energy expectation 
value Eq: 

(V^olV^o) Jd^x\i)o\'^ 

However, the method is not limited to extraction of only the ground state wavefunction and 
expectation values. In the next sections we will describe two different methods that can be 
used to extract, in addition, excited state wavefunctions. 



III. THE FINITE DIFFERENCE TIME DOMAIN METHOD 

To numerically solve the Wick-rotated Schrodinger equation (12. 4 p one can approximate 
the derivatives by using discrete finite differences. For the application at hand we can, 
without loss of generality, assume that the wavefunction is real- valued as long as the potential 
is real-valued. The imaginary time derivative becomes 

— ^ X, y, z, t) ^ 3.1 

or At 

where At is some finite change in imaginary time. 

Similarly, the right hand side of equation (12. 4p becomes 

- V^\I^(f, r) - ■i/(f)\&(f, r) ^ 



2 



[^{x + Ax, y, z, t) - 2^f (x, z, r) + ^(x - Ax, y, z, r)] 



2Ax2 

-^[^(x, y + Ay, z, r) - 2^(x, y, z, r) + ^(x, y - Ay, z, r)] 



2Ay 
1 

+ y,z + Az, r) - 2^(x, y, z, r) + *(x, y,z- Az, r)] 

- V(x, y, z) [^(x, y, z, r) + ^(x, y,z,T + At)] , (3.2) 



where, in the last term, we have averaged the wavefunction in imaginary time in order to 



improve the stability of the algorithm following Tafiove [4| and Sudiarta and Geldart |45 
Note that if the potential V has singular points these have to be regulated in some way, e.g. 
by ensuring that none of the lattice points coincides with a singular point. Assuming, for 
simplicity, that the lattice spacing in each direction is the same so that a = Ax = Ay = Az 
this equation can be rewritten more compactly by defining a difference vector 



D ^ ^[1,-2,1], (3.3) 



together with a matrix-valued ^ field 



"^{x — a,y, z,t) '^{x,y — a, z,t) ^{x,y,z — a,T) 

■^{x,y,z,T) ^(x,y,z,r) '^{x,y,z,T) 
'^{x + a,y, z,t) '^{x,y + a, z,t) '^{x,y,z + a,T) 



(3.4) 
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giving 



-V^^(f, r) - 1/(f)^(f, r) 



1 ^ 1 

- (^D ■ - -V{x, y, z) \^{x, y, z, r) + ^{x, y,z,r + At)] 



(3.5) 



Rewriting equation (12. 4p with equations (13. ip and (13. 5p gives the following update equa- 
tion for \l/(x, y, z, t) in imaginary time: 



(x, y,z,T + At) = A ^(x, y, z, r) + 



BAt 
2m 



where A and B are 



l-^V{x,y,z) 
l + ^V{x,y,z) 



B 



E 



1 



D * 



l + ^V(x,y,^) 



(3.6) 



(3.7) 



and we have reintroduced the mass, m, for generality. Evolution begins by choosing a 
random 3d wavefunction as the initial condition. In practice, we use Gaussian distributed 
random numbers with an amplitude of one. The boundary values of the wavefunction are 
set to zero; however, other boundary conditions are easily implemented. ^ Note that during 
the imaginary time evolution the norm of the wavefunction decreases (see Eq. 12. 5p . so we 
additionally renormalize the wavefunction during the evolution in order to avoid numerical 
underflow. This does not affect physical observables. 

We solve Eq. (13. 6 p on a three-dimensional lattice with lattice spacing a and lattice 
sites in each direction. Note that the lattice spacing a and size L = Na should be chosen so 
that the states one is trying to determine (i) fit inside of the lattice volume, i.e. \I/rms ^ L, 
and (ii) are described with a sufficiently fine resolution, i.e. ^'rms ^ cl- Also note that 
since we use an explicit method for solving the resulting partial differential equation for the 
wavefunction, the numerical evolution in imaginary time is subject to numerical instability 
if the time step is taken too large. Performing the standard von Neumann stability analysis 
4l| one finds that At < a^/3 in order achieve stability. For a fixed lattice volume a = L/N, 
therefore. At oc N'"^ when keeping the lattice volume fixed. The total compute time scales 
as ttotai oc iV^iVtime steps and assuming A^time steps oc (Ar)~^, we find that the total compute 
time scales as ttotai oc N^. 

At any imaginary time r the energy of the state, E, can be computed via a discretized 
form of equation (12. 7p 



Ex,y,^*(a;,2/,2;,r) 



Em 



I Ell -Vix,y,z)^{x,y,z,T) 



(3.8) 



We note that parallel implementations of absorbing boundary conditions may present a bottleneck for the 



parallel calculation. Many implementations of perfectly matched layers exist 
efficient parallel implementations exist with the maximum efficiency of t 



30-32 



; however, only a few 
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compute 



the WE-PML scheme 
unparallelized solution of the Schrodinger equation 



oc iVjjQjjjg achieved using 

37| . To the best of our knowledge the PML method has only been applied to 
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Excited states are extracted by saving the full 3d wavefunction to local memory peri- 
odically, which we will call taking a "snapshot" of the wavefunction. After convergence of 
the ground state wavefunction these snapshots can be used, one by one, to extract states 
with higher-energy eigenvalues by projecting out the ground state wavefunction, then the 
first excited state wavefunction, and so on [29|. In principle, one can extract as many states 
as the number of snapshots of the wavefunction saved during the evolution. For example, 
assume that we have converged to the ground state ipQ and that we also have a snapshot 
version of the wavefunction ^E'snap taken during the evolution. To extract the first excited 
state ipi we can project out the ground state using 

1^1 > ~ |^snap> - |^0><^o|*snap> ■ (3.9) 

For this operation to give a reliable approximation to ipi the snapshot time should obey 
Tsnap ^ ^ / {E2 — El) . One can use another snapshot wavefunction that was saved and obtain 
the second excited state by projecting out both the ground state and the first excited state. 

Finally we mention that one can extract the binding energy of a state by computing its 
energy and subtracting the value of the potential at infinity 

i?Hndin,M=i?M-^|!|y|^, (3.10) 

where 14o = linir^-oo V{x, y, z) with r = a/x^ + + as usual. Note that if 14o is a 
constant, then Eq. f l3.10p simplifies to -^binding [V^] = E[ip\ — Voo- 



IV. IMPOSING SYMMETRY CONDITIONS ON THE INITIAL WAVEFUNC- 
TION 

Another way to calculate the energies of the excited states is to impose a symmetry 
constraint on the initial conditions used for the FDTD evolution. The standard evolution 
calls for a random initial wavefunction; however, if we are solving a problem that has a 
potential with sufficient symmetry we can impose a symmetry condition on the wavefunction 
in order to pick out the different states required. For example, if we were considering 
a spherically symmetric Coulomb potential then we could select only the Is, 2s, 3s, etc. 
states by requiring the initial condition to be reflection symmetric about the x, y, and z 
axes.^ This would preclude the algorithm finding any anti-symmetric states such as the 
Ip state since evolution under the Hamiltonian operator cannot break the symmetry of the 
wavefunction. Likewise to directly determine the Ip excited state one can start by making 
the FDTD initial state wavefunction anti-symmetric about one of the axes, e.g. the z-axis. 
As we will show below this provides for a fast and accurate method for determining the 
low-lying excited states. 

Notationally, we will introduce two symbols, the symmetrization operator Si and the 
anti-symmetrization operator Ai. Here i labels the spatial direction about which we are 
(anti-)symmetrizing, i.e. i G {x,y,z}. Although not required, it is implicit that we perform 



Of course, for a spherically symmetric potential a fully 3d method for solving the Schrodinger equation 
is unecessary since one can reduce the problem to solving a Id partial differential equation. We only use 
this example because of its familiarity. 
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the symmetrization about a plane with x = 0, y = 0, or z = 0, respectively. In practice 
these are implemented by initializing the lattice and then simply copying, or copying plus 
flipping the sign, elements from one half of the lattice to the other. In practice, we flnd that 
due to round-off error one should reimpose the symmetry condition periodically in order to 
guarantee that lower-energy eigenstates do not reappear during the evolution. 

V. FDTD PARALLELIZATION STRATEGY 

Parallelizing the FDTD algorithm described above is relatively straightforward. Ideally, 
one would segment the volume into M equal subvolumes and distribute them equally across 
all computational nodes; however, in this paper we will assume a somewhat simpler possi- 
bility of dividing the lattice into "slices" . Our method here will be to start with a A^^ lattice 
and slice it along one direction in space, e.g. the x direction, into M pieces where is 
divisible by M. We then send each slice of [N/M) x N"^ lattice to a separate computational 
node and have each computational node communicate boundary information between nodes 
which are evolving the sub-lattices to its right and/or left. The partitioning of the lattice is 
indicated via a 2d sketch in Fig. [1] In practice, in order to implement boundary conditions 
and synchronization of boundaries between computation nodes compactly in the code, we 
add "padding elements" to the overall lattice so that the actual lattice size is {N + 2)^. The 
outside elements of the physical lattice hold the boundary value for the wavefunction. In all 
examples below the boundary value of the wavefunction will be assumed to be zero; however, 
different types of boundary conditions are easily accomodated. When slicing the lattice in 
order to distribute the job to multiple computational nodes we keep padding elements on 
each slice so that the actual size of the slices is {N/M + 2) x (A^ + 2)^. Padding elements on 
nodes that share a boundary are used to keep them synchronized, while padding elements 
on nodes that are at the edges of the lattice hold the wavefunction boundary condition. 

In Fig. [2] we show a flow chart that outlines the basic method we use to evolve each 
node's sub-lattice in imaginary time. In the flgure each column corresponds to a separate 
computational node. Solid lines indicate the process flow between tasks and dashed lines 
indicate data flow between computational nodes. Shaded boxes indicate non-blocking com- 
munications calls that allow the process flow to continue while communications take place. 
As can be seen from Fig. |2]we have optimized each lattice update by making the flrst step in 
each update iteration a non-blocking send/receive between nodes. While this send/receive 
is happening each node can then update the interior of its sub-lattice. For example, in the 
two node case show in Fig. [T]this means that node 1 would update all sites with an x-index 
between 1 and 3 while node 2 would update sites with x-index between 6 and 8. Once 
these interior updates are complete each node then waits for the boundary communication 
initiated previously to complete, if it has not already done so. Once the boundaries have 
been synchronized, the boundary elements themselves can be updated. Going back to our 
example shown in Fig. [T]this would mean that node 1 would update all sites with x-index 
of 4 and node 2 would update all sites with an x-index of 5. 

Convergence is determined by checking the ground state binding energy periodically, e.g. 
every one hundred time steps, to see if it has changed by more than a given tolerance. 
In the code, the frequency of this check is an adjustable parameter and should be tuned 
based on the expected energy of the state, e.g. if the energy is very close to zero then 
convergence can proceed very slowly and the check frequency should be correspondingly 
larger. Parametrically the check frequency should scale as 1/Eq. 
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FIG. 1: A sketch of the partition of an original lattice into two sub-lattices that can be simulated 
on separate computational nodes. In this example, we show the partitioning of an N"^ = 8^ lattice 
into M = 2 sub-lattices of 4 x 8^ . The third dimension is suppressed for clarity. Light grey shaded 
boxes indicate sites that contain boundary value information and the dark grey shaded boxes 
indicate sites that contain information which must be synchronized between node 1 and node 2. 
White boxes indicate regions where the updated wavefunction value is stored. 

For computation of observables each computational node computes its contribution to the 
observable. Then a parallel call is placed that collects the local values computed into a central 
value stored in computational node 1. Then node 1 broadcasts the value to the other nodes 
so that all nodes are then aware of the value of the particular observable. For example, to 
compute the energy of the state as indicated in Eq. fl3.8p each computational node computes 
the portion of the sum corresponding to its sub-lattice and then these values are collected 
via a parallel sum operation to node 1 and then broadcast out to each node. Each node 
can then use this information to determine if the wavefunction evolution is complete. We 
note that the normalization of the wavefunction is done in a similar way with each node 
computing its piece of the norm, collecting the total norm to node 1, broadcasting the total 
norm to all nodes, and then each node normalizes the values contained on its sub-lattice. In 
this way computation of observables and wavefunction normalization is also parallelized in 
our approach. 

A. Scaling of our Id partitioning 

In most settings computational clusters are limited by their communication speed rather 
than by CPU speed. In order to understand how things scale we introduce two time scales: 
Atu which is the amount of time needed to update one lattice site and Atc which is the 
amount of time needed to communicate (send and receive) the information contained on 
one lattice site. Typically Ar„ < Atc unless the cluster being used is on an extremely fast 
network. Therefore, the algorithm should be optimized to reduce the amount of communi- 
cations required. 
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FIG. 2: Flow chart showing a sketch of our paraUel algorithm. Each column represents a distinct 
computational node. Solid lines are process flow lines and dashed lines indicate data flow. Shaded 
boxes indicate non-blocking communications calls that allow the process flow to continue while 
communications take place. 



For the one-dimensional partitions employed here 



Tc = 27V2ATe, (5.1) 
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where -/Vnodes the number of Id shces distibuted across the cluster and the factor of 2 
comes from the 2 surfaces which must be communicated by the internal partitions. For the 
calculation not to have a communications bottleneck we should have Tc < x^. Using (15 .ip 
we find that this constraint requires 

/ViV<i(|^)iV. (5.2) 

In the benchmarks section below we will present measurements of Atu and Atc using our 
test cluster. We find that At^ ~ Ar^/S. Using this, and assuming, as a concrete example, a 
lattice size of = 1024 we find N^^^^^ < 102. For clusters with more than 102 nodes it would 
be more efficient to perform a fully 3d partitioning. In the case of a fully 3d partitioning 
one finds that the limit due to communications overhead is A^nodes ~ 39768. 



VI. THE MULTI-RESOLUTION TECHNIQUE 



If one is interested in high-precision wavefunctions for low-lying excited states, an efficient 
way to do this is to use a multi-resolution technique. This simply means that we start with 
a random wavefunction on small lattice, e.g. 64^, and use the FDTD technique to determine 
the ground state and first few excited states and save the wavefunctions, either in local 
memory or disk. We can then use a linear combination of the coarse versions of each state 
as the initial condition on a larger lattice, e.g. 128^, while keeping the lattice volume fixed. 
We can then "bootstrap" our way up to extremely large lattices, e.g. on the order of 
1024^ — )■ 2048^, by proceeding from low resolution to high resolution. In the results section 
we will present quantitative measurements of the speed improvement that is realized using 
this technique. 



VII. RESULTS 



In this section we present results obtained for various 3d potentials and benchmarks that 
show how the code scales with the number of computational nodes. Our benchmarks were 
performed on a small cluster of 4 servers, each with two quad-core 2 GHz AMD Opteron 
processors. Each server can therefore efficiently run eight computational processes simul- 
taneously, allowing a maximum of 32 computational nodes. ^ The servers were networked 
with commercial 1 Gbit/s TCP/IP networking. For the operating system we used 64 bit 
Ubuntu Server Edition 8.10 Linux. 



A. Implementation 

In order to implement the parallel algorithm we use a mixture of C/C++ and the Message 
Passing Interface (MPI) library for message passing between computational nodes Hi. The 
servers used the OpenMPI implementation of the MPI API.^ The code itself is open-sourced 



^ Due to a server upgrade during publishing we were able to extend to 64 computational nodes in the 

general benchmark section. 
^ The code was also tested against the MPICH implementation of the MPI API with similar results. 
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FIG. 3: Time to complete one iteration of the update equation for the wavefunction in imaginary 
time as a function of the number of nodes. In (a) the dashed line is a linear fit to the data. In (b) 
the lines correspond to legend indicated. Error bars are the standard error determined by averaging 
observables over 10 runs, cJe = (j/^/N, where a is the standard deviation across the sampled set of 
runs and N is the number of runs. 



under the Gnu General Public License (GPL) and is available for internet download via the 
URL in Ref. [43 . 



B. General Benchmarks 

In this section we present data for the scaling of the time of one iteration and the time 
for communication on a A^^ = 512'^ lattice. As discussed in Sec. IV Al we expect to see ideal 
scaling of the code as long as communication time is shorter than the update time, i.e. 
Tc < Tu- In Fig. |3K we show the time to complete one iteration as a function of the number 
of computational nodes on a log- log axis along with a linear fit. The linear fit obtained gives 
^iteration oc ^i^des^^^ ^^ • In addition, in Fig. [3b we show a comparison of the full time for each 
iteration with the amount of time needed to communicate a lattice site's information (in 
this case the local value of the wavefunction). In both Fig. [3^ and [3b the error bars are the 
standard error determined by averaging over 10 runs, cXe = a/y/N, where a is the standard 
deviation across the sampled set of runs and A^ is the number of runs. 

As can be seen from Fig. [3]d using a 512^ lattice the algorithm performs well up to 
A'nodes = 64 at which point the communication time becomes equal to the iteration time. 
For A^nodes > 64 we would see a violation of the scaling above due to communication overhead. 
Note that this is rough agreement with our estimate from Sec. IV Al which, for 512'^ lattice 
predicts the point where communications and update times to be equal to be A'nodes ~ 51. 
Note that in Fig. [Hb the increase in communication times as A'nodes increases is due to 
the architecture of the cluster used for the benchmarks which has eight cores per server. If 
A^nodes < 8 then all jobs run on one server, thereby decreasing the communications overhead. 
In the next section, we will present benchmarks for different potentials in order to (a) confirm 
the scaling obtained above in specific cases and (b) to verify that the code converges to the 
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physically expected values for cases which are analytically solvable. 



C. 



Coulomb Potential Benchmarks 



We use the following potential for finding the Coulomb wavefunctions 




r < a 
r > a , 



(7.1) 



where a is the lattice spacing in units of the Bohr radius and r is the distance from the 
center of the 3d lattice. The constant of 1/a is added for r > a in order to ensure that the 
potential is continuous at r = a. This is equivalent to making the potential constant for 
r < a and shifting the entire potential by a constant which does not affect the binding energy. 
Analytically, in our natural units the binding energy of the nth state is En = — l/(2(n + l)^) 
where n > is the principal quantum number labeling each state. The ground state therefore 
has a binding energy of Eq = —1/2 and the first excited state has Ei = —1/8, etc. Note 
that to convert these to electron volts you should multiply by 27.2 eV. 

In Fig. m we show the amount of time needed in seconds to achieve convergence of the 
ground state binding energy to a part in 10^ as a function of the number of computational 
nodes for iVnodes £ {4, 8, 16, 32} on a log-log plot. For this benchmark we used a lattice 
with A^^ = 512^, a constant lattice spacing of a = 0.05, a constant imaginary time step 
of At = a^/4 = 6.25 x 10~^, and the particle mass was also set to m = 1. In order 
to remove run-by-run fluctuations due to the random initial conditions we used the same 
initial condition in all cases. In Fig. H] the error bars are the standard error determined by 
averaging over 10 runs, cTg = cr/\/N, where a is the standard deviation across the sampled 
set of runs and N is the number of runs. In all cases shown the first two energy levels 
obtained were Eq = —0.499 and Ei = —0.122. This corresponds to an accuracy of 0.2% 
and 2.4%, respectively. In Fig. Hlthe extracted scaling slope is close to 1 indicating that the 
compute time in this case scales almost ideally, i.e. inversely proportional to the number of 
computing nodes. Note that the fit obtained in Fig. H] has a slope with magnitude greater 
than 1 indicating scaling which is better than ideal; however, as one can see from the figure 
there is some uncertainty associated with this fit. 

D. 3d Harmonic Oscillator Benchmarks 

We use the following potential for finding the 3d harmonic oscillator wavefunctions 



where r is the distance from the center of the 3d lattice. 

In Fig. |5] we show the amount of time needed in seconds to achieve convergence of the 
ground state binding energy to a part in 10^ as a function of the number of computational 
nodes for A'nodes £ {4,8, 16,32}. For this benchmark we used a constant lattice spacing of 
a = 0.02, a constant imaginary time step of At = a^/A = 1.0 x 10"^, and a A^^ = 512^ 
dimension lattice so that the box dimension was L = aN = 10.24. In Fig. [5] the error bars 
are the standard error determined by averaging over 10 runs, ae = (t/\/N, where cr is the 




(7.2) 
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FIG. 4: Compute time versus number of computational nodes on a log-log plot for the Coulomb 
potential specified in Eq. (jT.ip . The dashed line is a linear fit to the data. Scaling exponent 
indicates that, in this case, the compute time scales inversely with the number of compute nodes. 
Error bars are the standard error determined by averaging over 10 runs, ag = (T/\fN , where a is 
the standard deviation across the sampled set of runs and A'^ is the number of runs. 

standard deviation across the sampled set of runs and N is the number of runs. The particle 
mass was also set to m = 1. In order to remove run- by-run fluctuations due to the random 
initial conditions we used the same initial condition in all cases. In all cases the ground state 
energy obtained was Eq = 1.49996 corresponding to an accuracy of 0.0026%. In Fig. [5] the 
extracted scaling slope is 0.91 meaning that the compute time scales as tcompute oc N~^^l^ 
in this case. This is a slightly different slope than in the Coulomb potential case. This is 
due to fluctuations in compute time due to server load and sporadic network delays. The 
scaling coefficient reported in the conclusions will be the average of all scaling coefficients 
extracted from the different potentials detailed in this paper. 

E. Dodecahedron Potential 

The previous two examples have spherical symmetry and hence it is not necessary to apply 
a fully 3d Schrodinger equation solver to them. We do so only in order to show scaling with 
computational nodes and percent error compared to analytically available solutions. As 
a nontrivial example of the broad applicability of the FDTD technique we apply it to a 
potential that is a constant negative value of = —100 inside a surface defined by a regular 
dodecahedron with the following 20 vertices 

(44^i)'(°'4'")'(4'"'°)'(4''''") • 
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FIG. 5: Compute time versus number of computational nodes on a log-log plot for the 3d harmonic 
oscillator potential specified in Eq. (j7.2p . The dashed line is a linear fit to the data. Scaling 
exponent indicates that, in this case, the compute time scales as tcomputc oc N~l^^l. Error bars are 
the standard error determined by averaging over 10 runs, cTg = a/^/N, where a is the standard 
deviation across the sampled set of runs and is the number of runs. 

where = (1 + a/5)/2 is the golden ratio. The value -1 is mapped to the point n = 1 and the 
value 1 is mapped to the point n = in all three dimensions. As a result, the containing 
sphere has a radius of \/3{N — l)/2(f). 

In Fig. [6] we show the ground and first excited states extracted from a run on a 128^ 
lattice with a lattice spacing of a = 0.1, an imaginary time step of Ar = 0.001 and particle 
mass of m = 1. On the left we show the ground state and on the right the first excited 
state. We find that the energies of these two levels are Eq = —99.78 and Ei = —99.55. 
Note that for the first excited state the position of the node surface can change during each 
run due to the random initial conditions used. In practice, the node surface seems to align 
along one randomly chosen edge of one of the pentagons that make up the surface of the 
dodecahedron. 

In Fig. [7] we show the amount of time needed in seconds to achieve convergence of the 
dodecahedron ground state binding energy to a part in 10^ as a function of the number of 
computational nodes for A'^nodes ^ {4,8, 16,32}. For this benchmark we used a 512^ lattice 
with a constant lattice spacing of a = 0.1, an imaginary time step of Ar = 0.001 and 
particle mass of m = 1. In Fig. [7] the error bars are the standard error determined by 
averaging over 10 runs, cTg = a/y/N, where a is the standard deviation across the sampled 
set of runs and N is the number of runs. In all cases the ground state energy obtained was 
Eq = —99.97. In Fig. [7|the extracted scaling slope is 0.91 meaning that the compute time 
scales as tcomputc oc N'^^^l in this case. 
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FIG. 6: Ground state (left) and first excited state (right) of a dodecahedron potential. Surfaces are 
constant probability density surfaces. For the ground state we show ip^ G {10~^^, 10~^, 10~^, 10~^} 
and for the first excited state G {10"''', 10~^, 10~^, 10~^}. Positive quadrant defined by x < || 
y > is cut out in order to view the interior of the wavefunctions. 

F. Applying Symmetry Constraints to the FDTD initial wavefunction 



One of the fundamental problems associated with using a single FDTD run to determine 
both the ground state and excited states is that typically the excited states are much more 
extended in space than the ground state, particularly for potentials with a "long range tail" 
like the Coulomb potential. For this reason it is usually difficult to obtain accurate energy 
eigenvalues for both ground and excited states unless the lattice has an extremely fine lattice 
spacing and a large number of points in each direction so that the dimension of the box is 
also large. In Sec. IVII CI we presented benchmarks for the Coulomb potential on a 512^ 
lattice that had a dimension of 25.6 Bohr radii. As we found in that section, we were able 
to determine the ground and first excited states to 0.2% and 2.4%. Improving the accuracy 
of the first excited state would require going to a lattice with dimensions larger than 512'^. 

While this is possible with the parallelized code, there is a more efficient way to find 
excited states by applying symmetry constraints to the initial wavefunction. For example, 
to find the Ip state of the Coulomb problem we can initialize the wavefunction as ^initial = 
-^z^random as discusscd in Sec. HVl In this case we explicitly project out the ground state 
wavefunction since it is symmetric about the z-axis. Applying this method on a 256^ lattice 
with lattice spacing a = 0.2 and imaginary time step At = 0.01 we find the first excited 
state energy to be Ei = —0.12507 which is accurate to 0.06%. At the same time we can 
extract the next excited state which is anti-symmetric about the 2;-axis (n = 2 state) finding 
in this case, E2 = —0.055360, corresponding to an accuracy of 0.4%. 

The application of symmetry constraints can also allow one to pick out states with differ- 
ent orientations in a 3d potential that breaks spherical symmetry. In Ref 4^ this technique 
was used to accurately determine the different heavy quarkonium p-wave states correspond- 
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FIG. 7: Compute time versus number of computational nodes on a log-log plot for the dodecahe- 
dron potential using the vertices defined in Eq. (j7.3p . The dashed line is a linear fit to the data. 
Scaling exponent indicates that, in this case, the compute time scales as ^compute oc N~^^l. Error 
bars are the standard error determined by averaging over 10 runs, ae = (t/^/N , where a is the 
standard deviation across the sampled set of runs and is the number of runs. 

ing to angular momentum = and = ±1. Therefore, the ability to constrain the 
symmetry of the initial FDTD wavefunction is a powerful technique. 



G. Application of the Multi-resolution Technique 

In this section we present benchmarks for the application of the multi-resolution technique 
to the Coulomb potential problem. The current version of the code supports this feature 
by allowing users the option of saving the wavefunction at the end of the run. The saved 
wavefunctions can then be read in and used as the initial condition for a subsequent run. The 
saved wavefunctions can have a different resolution than the resolution of the new run and 
the code automatically adjusts by sampling/spreading out the wavefunction appropriately. 

By using this technique we can accelerate the determination of the high accuracy energy 
eigenvalues and wavefunctions. In Sec. IVII CI we found that using 32 computational nodes 
and a random initial wavefunction a 512'^ run took approximately 1.3 hours. Scaling naively 
to a 1024^ lattice, while keeping the lattice volume fixed, would take approximately 42 
hours. Using the multi-resolution technique and bootstrapping from 128^ up to 1024^ a 
high resolution ground state and energy eigenvalue can be computed in approximately 45 
minutes using the same 32 computational nodes. At the final resolution of a = 0.025 and 
a lattice size of 25.6 Bohr radii the 1024'^ run gives Eq = —0.499632 which is accurate to 
0.07%. Therefore, the multi-resolution technique provides a performance increase of a factor 
of 50 compared to using random initial wavefunctions for all runs. 
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VIII. CONCLUSIONS AND OUTLOOK 



In this paper we have described a parallel FDTD algorithm for solving the 3d Schrodinger 
equation. We have shown that for large 3d lattices the method gives a compute time that 
scales as tcompute oc A^^des^^°'°''- This final scaling coefficient and associated error were 
obtained by averaging the three different scaling coefficients extracted for the Coulomb, 
harmonic oscillator, and dodecahedron potentials. The crucial optimization that allowed us 
to achieve nearly ideal scaling was the use of non-blocking sends/receives of the boundary 
data so that update of each node's sub-lattice can proceed while communication of the 
boundary information is taking place, providing for an "inside-out" update algorithm. 

Additionally we introduced two novel techniques that can be used in conjunction with 
the FDTD method. First, we discussed the possibility of imposing a symmetry constraint on 
the initial wavefunction used for the FDTD evolution. The imposed symmetry constraint 
allows us to easily construct states that are orthogonal to the ground state and/or some 
other excited states. Using this technique we can select states that have a certain symmetry, 
thereby allowing for extremely accurate determination of the particular states we are inter- 
ested in. Second, we introduced the "multi-resolution technique" which simply means that 
we use the FDTD output wavefunctions from lower-resolution runs as the initial condition 
for higher-resolution runs. Using this method we showed that we can efficiently "bootstrap" 
our way from small to large lattices, thereby obtaining high-accuracy wavefunctions and 
eigenvalues in a fraction of the time required when using random initial wavefunctions on 
large lattices. 

The code developed for this paper has been released under an open-source GPL license 



43[. An obvious next step will be to extend the code to fully 3d partitions, which is in 
progress. Other areas of improvement include adding support for different types of boundary 
conditions and complex potentials j46| . 
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